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Abstract 

In  this  paper  we  propose  a  partially  hard-wired  Elman  network.  A  distinct  feature 
of  our  approach  is  that  only  minor  modifications  of  existing  on-line  and  off-line 
learning  algorithms  are  necessary  in  order  to  implement  the  proposed  network. 
This  allows  researchers  to  adapt  easily  to  trainable  recurrent  networks.  Given  this 
network  architecture,  we  show  that  in  a  general  dynamic  environment  the  standard 
back-propagation  estimates  for  the  learnable  connection  weights  can  converge  to  a 
mean  square  error  minimizer  with  probability  one  and  are  asymptotically  normally 
distributed. 


• 
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1      Introduction 

Neural  network  models  have  been  successfully  applied  in  a  wide  variety  of  dis- 
ciplines. Typically,  applications  of  networks  with  at  least  partially  modifiable 
interconnection  strengths  are  based  on  the  so-called  multilayer  feedforward  archi- 
tecture, in  which  all  signals  are  transmitted  in  one  direction  without  feedbacks.  In 
a  dynamic  context,  however,  a  feedforward  network  may  have  difficulties  in  repre- 
senting certain  sequential  behavior  when  its  inputs  are  not  sufficient  to  characterize 
temporal  features  of  target  sequences  (Jordon,  1985).  From  the  cognitive  point  of 
view,  a  feedforward  network  can  perform  only  passive  cognition,  in  that  its  out- 
puts cannot  be  adjusted  by  an  internal  mechanism  when  static  inputs  are  present 
(Norrod,  O'Neill,  Sz  Gat,  1987).  These  deficiencies  thus  restrict  the  applicability 
of  feedforward  neural  network  models  in  dynamic  environments. 

In  view  of  these  problems,  researchers  have  recently  been  studying  recurrent 
networks,  i.e.,  networks  with  feedback  connections,  see  e.g.,  Jordon  (1986),  El- 
man  (1988),  Williams  &  Zipser  (1988),  and  Kuan  (1989).  In  a  recurrent  network, 
recurrent  variables  compactly  summarize  the  past  information  and,  together  with 
other  input  variables,  jointly  determine  the  network  outputs.  Because  recurrent 
variables  are  generated  by  the  network,  they  are  functions  of  the  network  connec- 
tion weights.  Owing  to  this  parameter  dependence,  the  standard  back-propagation 
(BP)  algorithm  for  feedforward  networks  cannot  be  applied  because  it  fails  to  take 
the  correct  gradient  search  direction  (cf.  Rumelhart,  Hinton,  &  Williams,  1986). 
Kuan,  Hornik  &;  White  (1990)  propose  a  recurrent  BP  algorithm  generalizing  the 
standard  BP  algorithm  to  various  recurrent  networks.  However,  this  algorithm 
has  quite  complex  updating  equations  and  restrictions,  and  therefore  cannot  be 
used  straightforwardly  by  recurrent  networks  practitioners. 


In  this  paper  we  suggest  an  easier  way  to  implement  recurrent  networks.  We 
focus  on  a  variant  of  the  Elman  (1988)  network,  in  which  only  a  subset  of  hidden 
unit  activations  serve  as  recurrent  variables.  We  propose  to  hard-wire  the  connec- 
tions between  the  recurrent  units  and  their  inputs.  This  approach  has  the  following 
advantages.  First,  the  resulting  network  avoids  the  aforementioned  problem  of  pa- 
rameter dependence.  Second,  the  necessary  constraints  on  recurrent  connections 
suggested  by  Kuan,  Hornik,  &  White  (1990)  can  easily  be  imposed  by  hard-wiring. 
Third,  off-line  learning  is  made  possible  for  the  proposed  network.  Consequently, 
only  minor  modifications  of  existing  on-line  and  off-line  learning  algorithms  are 
needed.  This  is  very  convenient  for  neural  network  practitioners.  Given  this  hard- 
wired network,  we  show  that  in  general  dynamic  environments  the  resulting  BP 
estimates  converge  to  a  mean  squared  error  minimizer  with  probability  one  and  are 
asymptotically  normally  distributed.  Our  convergence  results  extend  the  results  of 
Kuan,  Hornik,  &  White  (1990)  for  general  recurrent  networks  and  are  analogous 
to  the  results  of  Kuan  k,  White  (1990)  for  feedforward  networks. 

This  paper  proceeds  as  follows.  In  section  2  we  briefly  review  recurrent  net- 
works. In  section  3  we  discuss  a  variant  of  the  Elman  network  and  its  learning 
algorithms.  We  establish  strong  consistency  and  asymptotic  normality  of  the  learn- 
ing estimates  in  section  4.  Section  5  concludes  the  paper.  Proofs  are  deferred  to 
the  appendix. 

2      Recurrent  Networks 

A  three  layer  recurrent  network  with  k  input  units,  /  hidden  units  with  common 
activation  function  ip,  and  m  output  units  with  common  activation  function  <p  can 


be  written  in  the  following  generic  form: 

ot     =     $(Wat  +  v) 

at     =     *(Cxt  +  Drt  +  b) 

rt     =     G(xt-itrt-i,0), 

where  the  subscript  t  indexes  time,  x  is  the  i;x  1  vector  of  network  inputs,  a 
is  the  /  x  1  vector  of  hidden  unit  activations,  o  is  the  m  x  1  vector  of  network 
outputs,  $  and  ^  compactly  denote  the  unitwise  activation  rules  in  the  output 
respectively  hidden  layer,  and  rt  is  the  n  x  1  vector  of  recurrent  variables  which 
is  computed  through  some  generic  function  G  from  the  previous  input  Xt-i,  the 
previous  recurrent  variable  Pt_i,  and 

0  =  [vec(C)',  vec(D)/,  vec(W)',b',  v']', 

the  vector  of  all  network  connection  weights.  (In  what  follows,  '  denotes  transpose, 
the  vec  operator  stacks  the  columns  of  a  matrix  one  underneath  the  other,  and  \v\ 
is  the  euclidean  length  of  a  vector  v.) 

More  compactly,  the  above  network  can  be  written  as 

ot     =     ${WV(Cxt  +  Drt+b)  +  v)  (1) 

rt     =     G{xt-\,rt_u6).  (2) 

That  is,  the  network  output  is  jointly  determined  by  the  external  inputs  x*  and 
the  recurrent  variables  r.  Clearly,  different  choices  of  G  yield  different  recurrent 
networks.  When  rt  =  ot-\  (output  feedback), 

rt  =  G(ar«_i,rt-i,tf)  =  *(W*(C*«-i  +  Drt.1  +  b)  +  v), 

and  we  obtain  the  Jordon  (1986)  network.  When  rt  =  at-\  (hidden  unit  activation 


feedbacks), 

r,  =  G(xt-i,  rt_u9)  =  *(Cxt_i  +  Drt-i  +  b), 

and  we  have  the  Elman  (1988)  network. 
By  recursive  substitution,  (2)  becomes 
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r<  =  Gto.i.r,.!,*)  =  G(z,_i,G(zt-2,n_2,0),0)  =  •••=:  ^(z'-\0), 

where  x'-1  =  (x<_  i ,  Xt_2,  •  •  • ,  xo)  is  the  collection  of  past  inputs.  Hence,  rt  is  a 
complex  nonlinear  function  of  9  and  the  entire  past  of  xt .  In  contrast  with  external 
input  xt,  we  may  interpret  rt  as  "internal"  input,  in  the  sense  that  it  is  gener- 
ated by  the  network.  Given  a  recurrent  network,  the  standard  BP  algorithm  for 
feedforward  networks  does  not  perform  correct  gradient  search  over  the  parameter 
space  because  it  fails  to  take  the  dependence  of  rt  on  the  learnable  network  weights 
into  account.  Consequently,  meaningful  convergence  cannot  be  guaranteed  (Kuan, 
1989). 

Kuan,  Hornik,  &  White  (1990)  propose  a  recurrent  BP  algorithm  which,  by 
carefully  calculating  the  correct  gradients  and  including  additional  derivative  up- 
dating equations,  maintains  the  desired  gradient  search  property.  To  ensure  proper 
convergence  behavior,  their  results  also  suggest  some  restrictions  on  the  network 
connection  weights.  That  is,  parameters  estimates  are  projected  into  some  "sta- 
bility" region  whenever  they  violate  the  imposed  constraints.  Thus,  much  more 
effort  is  needed  in  programming  appropriate  learning  algorithms  for  recurrent  net- 
works. Moreover,  some  of  their  conditions  to  ensure  convergence  of  the  recurrent 
BP  algorithm  are  rather  stringent. 


J 


3      A  Partially  Hard- Wired  Elman  Network 


In  this  section  we  suggest  an  easier  way  to  implement  a  variant  of  the  Elman 
(1988)  network.  As  we  have  discussed  in  section  2,  improper  convergence  of  the 
learning  algorithms  is  mainly  due  to  the  dependence  of  the  internal  inputs  rt  on 
the  modifiable  network  parameters.  To  circumvent  this  problem,  we  propose  to 
modify  the  Elman  network  as  is  depicted  in  figure  1. 
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Figure  1.   The  proposed  partially  hard-wired  recurrent  network.   Modifiable  and 
hard- wired  connections  are  represented  by  — ►  respectively  =>. 


The  hidden  units  are  partitioned  into  two  groups  containing  If  respectively  lr  = 
I  —  1/  units,  and  only  the  units  in  the  second  group  serve  as  recurrent  units. 
Intuitively,  the  units  in  the  first  group  play  the  standard  role  in  artificial  neural 


networks,  whereas  the  task  of  the  recurrent  units  is  to  "index"  information  on 
previous  inputs.  Furthermore,  the  connections  between  the  recurrent  units  and 
their  inputs  are  hard-wired. 

Hence,  a  is  partitioned  as  a  =  [a.,a'r]',  where  a/  is  the  If  x  1  vector  of  acti- 
vations of  the  (purely  feedforward)  hidden  units  in  the  first  group,  and  ar  is  the 
lr  x  1  vector  of  activations  of  the  feedback  (recurrent)  hidden  units  in  the  second 
group  such  that 

rt  =  ar>t-i. 

If  the  connection  matrices  C  and  D  and  the  bias  vector  6  are  partitioned  com- 
formably  as 


» 


C  = 


Cf' 

D  = 

'D/ 

b  = 

'h 

crj 

.A-. 

Jr 

then 


aJ,t     =     *(CfXt  +  Dfar,t-i  +  bf) 

ar,t       -       W(CrXt  + DrUr.t-l  +  br), 

where  now  Cr,  Dr  and  b~r  are  fixed  due  to  hard-wiring.  Different  choices  of  Cr, 
Dr  and  6r  determine  how  the  past  information  should  be  represented,  hence  they 
are  problem-dependent  and  should  be  left  to  researchers. 

Hence,  writing  the  proposed  network  in  a  nonlinear  functional  form,  we  have 


ot  =  $(WV(Cxt  +  Darit-l  +  b)  +  v)  =:  F(xt,ar,t-i,0.) 


(3) 


and 


rt  =  <Jr,t-i  =  ^(C-Zt-i  +  L>rar,t-2  +  br)  =■  G{xt-i,arit-2,0), 


where  now 


(4) 


6  =  [vec(Wy,  vec(C/)',  vec(L>/)/,  b'f ,  v']' 


is  the  p  x  1  vector  which  contains  all  the  learnable  network  weights,  where  p  := 
m(l  +  l)  +  lj(k  +  lr  +  1),  and 

^=[vec(Cr)/,vec(Dr)/,6/r]/ 

contains  all  the  hard-wired  weights.  By  recursive  substitution,  (4)  becomes 

rt  =  G{xt-Urt_u~9)  =  G(xt-i,G(xt-2,rt-2,6) ,6)  =  ■■■=:  ^(z*"1,*), 

cf.  equation  (2).  Thus,  rt  —  ar(_i  is  a  function  of  the  entire  past  of  xt  and  the 
hard-wired  weights  9. 

Because  rt  is  not  a  function  of  the  learnable  weights  9,  the  aforementioned 
problem  of  parameter  dependence  is  thus  avoided.  It  follows  that  the  standard 
BP  algorithm  for  feedforward  networks  is  applicable  to  the  proposed  network  with 
respect  to  the  learnable  weigths  9.  Letting  yt  denote  the  target  pattern  presented 
at  time  t,  the  BP  algorithm  is 

0t+i  =9\  +  TltV9F(xt]rt,9t)(iM  -F(xt,rt,$t)),  (5) 

where  rjt  is  learning  rate  employed  at  time  t  and  VqF  is  the  matrix  of  partial 
derivatives  of  F  with  respect  to  the  components  of  9.  However,  in  both  theory  and 
practice  it  is  necessary  to  keep  the  BP  estimates  in  some  compact  subset  0  of  IRP , 
thus  preventing  the  entries  from  becoming  extremely  large.  This,  being  a  typical 
requirement  in  the  convergence  analysis  of  the  BP  type  of  algorithms,  see  e.g., 
Kuan  &  White  (1990)  and  Kuan,  Hornik,  &  White  (1990),  can,  if  not  automatically 
guaranteed  by  the  algorithm,  be  accomplished  by  applying  a  projection  operator 
t  which  maps  Mp  onto  0  to  the  BP  estimates.  Usually,  a  truncation  device  is 
convenient  for  this  purpose.  This  requirement  entails  little  loss  because  it  is  usually 
inactive  when  very  large  trunaction  bounds  are  imposed. 


In  light  of  (5),  we  only  have  to  modify  the  existing  BP  algorithm  slightly  to 
incorporate  the  internal  inputs  ar  into  the  algorithm.  Furthermore,  if  a  fixed 
training  data  set  is  given,  the  internal  inputs  art  can  be  calculated  first,  and 
off-line  learning  methods  such  as  nonlinear  least  squares  can  then  be  applied  to 
estimate  the  learnable  weights  0.  These  advantages  allow  researchers  to  adapt  to 
recurrent  networks  quite  easily.  It  is  then  interesting  to  know  the  properties  of  the 
algorithm  (5)  applied  to  the  proposed  network  given  by  (3)  and  (4).  This  is  the 
topic  to  which  we  now  turn. 

4      Asymptotic  Properties  of  the  BP  Algorithm 

Let  {Vt}  be  some  sequence  of  random  variables  defined  on  a  probability  space 
(fi,^7,  P),  T\  be  the  cr-algebra  generated  by  Vr,  VT+i, .  .  . ,  Vj,  and  let  {Zt}  be  a 
sequence  of  square  integrable  random  variables  on  that  probability  space.  We 
write  Elttm(Zt)  f°r  ^le  conditional  expectation  E{Zt\T\-m)  and  ||  •  ||  for  the  norm 
in  Z,2(P),  i.e.,  ||Z||  =  (^IZI2)1/2. 

Definition  4.1.  Let 

vm:=  sup  \\Zt-E}t%(Zt)\\. 

t 

Then  {Zt}  is  near  epoch  dependent  (NED)  on  {Vt}  of  size  —a  if  for  some  A  <  —a, 
um  =  0(mx)  as  m  — *  oo. 

This  definition  conveys  the  idea  that  a  random  variable  depends  essentially  on 
the  information  generated  by  "more  or  less  current"  Vt  and  does  not  depend  too 
much  on  the  information  contained  in  the  distant  future  or  past.  The  larger  the 
magnitude  of  the  size  of  um,  the  faster  the  dependence  of  the  remote  information 
dies  out.  More  details  on  near  epoch  dependence  can  be  found  in  Billingsley  (1968), 
McLeish  (1975),  and  Gallant  k  White  (1988). 
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The  lemma  below  ensures  that  recurrent  variables  are  well  behaved  and  do  not 
have  too  long  memory. 

Lemma  4.2.  Let  {rt\  be  generated  by  (4),  where  {xt}  is  NED  on  {Vt}  of  size  —a 
and  the  common  hidden  unit  activation  function  xp  is  bounded  and  continuously 
differentiable  with  bounded  first  derivative.  If  |vec(Dr)|  <  Ml  ,  where  M^  :  = 
snPa£R  l^'l*7)!)  then  {rt}  is  a  bounded  sequence  NED  on  {Vt}  of  size  —a. 

Remark  1.  Notice  that  if  the  input  data  {xt}  form  a  sequence  of  independent 
random  variables  (which  is  a  special  case  of  an  NED  sequence),  then  {rt}  need  not 
necessarily  be  mixing  but  is  NED  on  {x(}  of  arbitrarily  large  size,  see  Gallant  & 
White  (1988,  pp.  27-31).  Hence,  introducing  the  concept  of  near  epoch  dependence 
is  not  a  technical  triviality,  but  a  necessity  when  dealing  with  feedback  networks 
in  stochastic  input  environments. 

In  what  follows  we  compactly  write  the  algorithm  (5)  as 

Ot+l  =0t  +  Vtht(0t), 

where  zt  =  (y't,x't)'  and  ht(0)  =  VgF(xt,rt,0)(yt  —  F(xt,rt,0)).  Our  consistency 
result  is  based  on  the  ordinary  differential  equation  (ODE)  method  of  Kushner  &; 
Clark  (1978),  cf.  Ljung  (1977).  This  approach  is  now  well-known  in  analyzing 
neural  network  learning  algorithms,  cf.  e.g.,  Oja  (1982),  Oja  &  Karhunen  (1985), 
Sanger  (1989),  Kuan  &  White  (1990),  Kuan,  Hornik,  &  White  (1990),  and  Hornik 
k  Kuan  (1990). 

We  need  the  following  notation.   Let  r0  =  0  and,  for  t  >  1,  let  rt  :=  ^)l0  i]i- 
The  piecewise  linear  interpolation  of  {6t}  with  interpolation  intervals  {rjt}  is 

9\r)  =  (TJ±^1-)  9,  +  (^)  0t  +  i,  r  6  [rtlrt+1), 


and  for  each  t,  its  "left  shift"  is 

Observe  in  particular  that  0*(O)  =  0°{rt)  =  6t. 
We  impose  the  following  conditions. 

A.l.  {Vt}  and  {zt}  are  defined  on  a  complete  probability  space  (Q.,J-,  P)  such 
that  for  some  r  >  4, 

(i)   {Vt}  is  a  mixing  sequence  with  mixing  coefficients  0m  of  size  — r/2(r— 1) 
or  am  of  size  —r/(r  —  2)  and 

(ii)  the  sequence  {zt}  is  NED  on  {l^}  of  size  —1  with  supt  |xt|  <  Mx  <  oo 
and  supt  E(\yt\r)  <  oo. 

A. 2.  For  the  network  architecture  as  specified  in  (3)  and  (4), 

(i)  <f)  and  V  are  continuously  differentiate  of  order  3.    t/>  is  bounded  and 
has  bounded  first  order  derivative. 

(ii)   |vec(Z)r)|  <  M~\  where  M^  =  sup,^  \^'{<r)\. 

A. 3.  {r]t}  is  a  sequence  of  positive  real  numbers  such  that  J2t  It  —  oo  and  ]Pt  rfi  < 

oo. 

A. 4.  For  each  0  6  0,  h(0)  =  limt  E(ht(0))  exists. 

A.l  allows  the  data  to  exhibit  a  considerable  amount  of  dependence  in  the  sense 
that  they  are  functions  of  the  (possibly  infinite)  history  of  an  underlying  mixing 
sequence.  For  more  details  on  a-  and  ^-mixing  sequences  we  refer  to  White  (1984). 
Assuming  that  the  external  inputs  xt  are  uniformly  bounded  simplifies  some  tech- 
nicalities needed  to  establish  convergence  and  causes  no  loss  of  generality,  as 
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pointed  out  by  Kuan  &,  White  (1990).  Desired  generality  is  assured  by  allow- 
ing the  yt  sequence  to  be  unbounded.  Note  that  typical  choices  for  ^  such  as 
the  logistic  squasher  and  hyperbolic  tangent  squasher  satisfy  A.2(i).  Condition 
A.2(ii)  is  needed  in  lemma  4.2  and  is  the  constraint  suggested  by  Kuan,  Hornik  & 
White  (1990)  for  general  recurrent  networks.  A. 3  is  a  typical  restriction  on  the 
learning  rates  for  BP  types  of  algorithms.  For  example,  learning  rates  of  order  l/t 
satisfy  this  condition.  A. 4  is  needed  to  define  the  associated  ODE  whose  solution 
trajectory  is  the  limiting  path  of  the  interpolated  processes  {#'(•)}. 

The  result  below  follows  from  corollary  3.5  of  Kuan  &c  White  (1990). 

Theorem  4.3.  For  the  network  given  by  (3)  and  (4)  and  the  algorithm  (5), 
suppose  that  assumptions  A.1-A.4  hold.    Then 

(a)  {#'(•)}  is  bounded  and  equiconiinuous  on  bounded  intervals  with  probability 
one,  and  all  limits  of  convergent  subsequences  satisfy  the  ODE  9  —  h{9). 

(b)  Let  0*  be  the  set  of  all  (locally)  asymptotically  stable  equilibria  of  this  ODE 
contained  in  0,  and  let  V(Q*)  C  JRP  be  the  domain  of  attraction  of  0* . 
Then,  if  9t  enters  a  compact  subset  ofV(Q*)  infinitely  often  with  probability 
one,  and  thus  in  particular,  ifQ  C  X>(0*),  then  with  probability  one,  6t  — ►  0* 

as  t  —  oo. 

Remark  2.  Because  the  elements  0*  of  0*  solve  the  equation  limt  E{h(zt,  rt,  0))  = 
h{9)  —  0,  they  (locally)  minimize 

UmE\yt-  F(xt,rt,9)\2.  (6) 

Theorem  4.3  thus  shows  that  the  BP  estimates  can  converge  to  a  mean  squared 
error  minimizer  with  probability  one.  Note  however  that  this  convergence  occurs 
conditional  on  9. 
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Remark  3.  By  the  Toeplitz  lemma,  linvr  T~l  Ylt=i  &\yt  ~  F{xt,rt,d)\2  is  the 
same  as  (6).  Therefore,  the  (on-line)  BP  estimates  converge  to  the  same  limit  as 
the  (off-line)  nonlinear  least  squares  estimator. 

Remark  4.  As  yt  is  not  required  to  be  bounded,  our  strong  consistency  result 
holds  under  less  stringent  conditions  than  those  of  Kuan,  Hornik  &  White  (1990) 
for  the  fully  recurrent  BP  algorithm. 

To  establish  asymptotic  normality  we  consider  the  algorithm  (5)  with  the  spe- 
cific choice  T]t  =  (t  +  l)-1.  (Note  that  no  limiting  distribution  results  for  BP  es- 
timators in  recurrent  networks  have  been  published  thus  far;  in  particular,  Kuan, 
Hornik,  &  White  (1990)  give  only  a  consistency  result  for  their  recurrent  BP  al- 
gorithm.) Let  Ut  '■—  {t  +  l)l/2(0t  —  0")  be  the  sequence  of  normalized  estimates. 
The  piecewise  constant  interpolation  of  Ut  on  [0,oo)  with  interpolation  intervals 
{(/.  +  l)-1}  is  defined  as 

U{r)  =  Ut,         T€[Tt,Tt+l)t 

and  again,  for  each  t  its  "left  shift"  is  defined  as 

Ut(r)  =  U(rt  +  r),  r>0. 

Finally,  let 

H(0)  :=lim£[V*/i,(0)]  +  /p/2, 

where  Ip  is  the  p-dimensional  identity  matrix. 

Our  result  follows  from  the  stochastic  differential  equation  (SDE)  approach  of 
Kushner  &  Huang  (1979).  In  contrast  with  the  ODE  approach,  the  interpolated 
processes  can  now  shown  to  converge  weakly  to  the  solution  paths  of  a  corre- 
sponding SDE  with  respect  to  the  Skorohod  topology.   For  more  details  on  weak 
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convergence  we  refer  to  Billingsley  (1968).  The  following  conditions  suffice  for  the 
asymptotic  normality  result. 

B.l.  A.l(i)  holds,  and  {zt}  is  a  stationary  sequence  NED  on  {Vj}  of  size  —8  with 
suPt  \xt\  <  Mx  <  oo  sup,  E(\yt\s)  <  oo. 

B.2.  A. 2  holds  with  <fr  and  ip  continously  differentiate  of  order  4. 

B.3.  9"  G  int(0)  is  such  that  h{9")  =  0  and  all  eigenvalues  of  H{9*)  have  negative 
real  parts. 

The  result  below  follows  from  corollary  3.6  of  Kuan  k.  White  (1990). 

Theorem  4.4.  Consider  the  network  given  by  (3)  and  (4)  and  the  algorithm  (5) 
with  T]t  =  (t  +  l)~l,  suppose  that  assumptions  B.1-B.3  hold  and  that  with  probability 
one,  9t  — »  9*  as  t  ■—*■  oo.  Then  {£/'(•)}  converges  weakly  to  the  stationary  solution 
of  the  stochastic  differential  equation 

dU{r)  =  ~H(9*)U(t)  dr  +  E(9*  f'2  dW(r), 

where  W  denotes  the  standard  p-variate  Wiener  process  and 

oo 

E(0*):=lim  Yl   E{ht(e*)ht+j(9*)']. 

j  =  —  co 

In  particular, 

{t  +  l)^2{9t  -9*)  -E^  N(0,S(9*)), 

where  a — ►'  signifies  convergence  in  distribution  and 

/•OO 

S(9'  ):=  exp(77(0*  )s)  E  exp(77(0*  )s)  ds 

Jo 

is  the  unique  solution  to  the  matrix  equation  H(9")S  +  SH{9*)'  —  —  D(tf*) 
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Remark  5.  If  rjt  =  (t  +  1)  *R,  where  R  is  a  nonsingular  p  x  p  matrix,  the 
SDE  in  theorem  4.4  becomes  dU(r)  =  H{9*)U{t)<1t  +  RH{9*)ll2  dW{r),  and  the 
covariance  matrix  of  the  asymptotitc  distribution  of  9t  becomes  RS(9*)R' . 

Remark  6.  If  the  probability  that  9t  converges  to  9*  is  positive,  but  less  than  one, 
the  above  theorem  provides  the  limiting  distribution  conditional  on  convergence  to 
9* .  Hence,  if  0*  contains  only  finitely  many  points,  assumption  B.3  is  satisfied  for 
each  9*  £  0* ,  and  9t  converges  with  probability  one  to  one  of  the  elements  of  0* , 
then  the  asymptotic  distribution  of  9t  is  a  mixture  of  N(9* ,S(9*))  distributions, 
weighted  relative  to  the  convergence  probabilities. 

5      Conclusions 

In  this  paper  we  propose  a  partially  hard-wired  Elman  network,  in  which  only 
a  subset  of  hidden-unit  activations  is  allowed  to  feed  back  into  the  network  and 
connections  between  these  hidden  units  and  input  layer  are  hard-wired.  A  distinct 
feature  of  our  approach  is  that  existing  on-line  and  off-line  learning  algorithms 
can  be  slightly  modified  to  implement  the  proposed  network.  (Note  that  off-line 
learning  is  not  possible  for  a  fully  learnable  recurrent  network.)  This  is  particularly 
convenient  for  researchers.  Our  results  also  show  that  the  estimates  from  the 
standard  BP  algorithm  adapted  to  this  network  can  converge  to  a  mean  squared 
error  minimizer  with  probability  one  and  are  asymptotically  normally  distributed. 
These  asymptotic  properties  are  analogous  to  those  of  the  standard  and  recurrent 
BP  algorithms. 

As  the  convergence  results  in  this  paper  are  conditional  on  the  hard-wired 
connection  weights  9,  the  resulting  weight  estimates  are  not  fully  optimal,  in  con- 
trast with  fully  learnable  recurrent  networks.  To  improve  the  performance  of  the 
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proposed  network,  one  can  train  the  network  with  various  hard-wired  connection 
weights  and  search  for  the  best  performing  architecture. 


f 


15 


Appendix 


Lemma  A.  Let  {xt}  be  NED  on  {Vt}  of  size  —a  and  let  the  square  integrable 
sequence  {rt}  be  generated  by  the  recursion 

rt  -  G(xt-i,rt-i,6). 

Suppose  that  G(-,r,0)  satisfies  a  Lipschitz  condition  uniformly  in  r,  i.e.,  there 
exists  a  finite  constant  L  such  that  for  all  r, 

\g(xur,9)  -  g(x2,r,9)\  <  L\xx  -  x2\, 

and  that  G(x,  -,9)  is  a  contraction  mapping  uniformly  in  x,  i.e.,  there  exists  some 
p  <  1  such  that  for  all  x, 

\G(x,ri,0)  -  g(x,r2,0)\  <  p\ri  -  r2\. 

Then  {rt}  is  NED  on  {Vt}  of  size  —a. 
Proof.  We  first  observe  that 

\\rt  -  E}t^(rt)\\ 

=     ||G(*,-i1rl-ll5)-£j+™(G(x,-iIr1-i>3))|| 

<  \\G{xt.urt.u6)-G{E\tT\^-i),E\tT2{rt.,),6)\\ 

<  \\G(xt.urt.u9)-G(EltT2^t.i),  r,.!,  0)|| 

<  LH*,.!  -  £?S-2(a,.i)||  +  p||r4-!  -  £j+r2(r«-i)||, 

where  the  first  inequality  follows  from  the  fact  that  El*™(G(xt-i,rt-i,9))  is  the 
best  mean  square  predictor  of  G(xt-i,  r«_i,  6)  among  all  .T7/*™ -measurable  func- 
tions and  the  second  inequality  follows  from  the  triangle  inequality.    Hence,  we 

16 


] 


obtain 

Ur,m  <  Lux.m-1  +  pVr,m-l,  (al) 

where  i/xm  and  ur%m  are  the  NED  coefficients  for  {xt}  and  {rt},  respectively.  We 
must  show  that  for  some  A  <  —a,  fr,m  is  0(mx)  as  m  — ►  oo.  Because  {xt}  is  NED 
on  {Vt}  of  size  a,  we  can  find  a  finite  constant  Co  and  some  Ao  <  —a  such  that 
Vx,m  <  Comx° .  By  the  fact  that  p  <  1,  we  can  find  mo  and  some  a  >  1  such  that 
pa  <  I  and  for  all  m  >  mo, 

{m/{m+  l))Ao  <  a. 

Let 

r-.  I    ^r,m0       CoL(T     I 

I/q  :=  max 
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L  mj°      1  —  per  J  ' 

We  now  prove  by  induction  that  for  all  m  >  mo,  vr,m  £  Domx° .  For  m  =  ?no, 
this  is  trivially  true  by  the  definition  of  Do.  Suppose  we  have  already  shown  that 
for  some  m  >  m0,  ^r,m  <  Doinx° .  Then,  using  (al), 

<  LC0mXo  +  pD0mXo 

=     (LCo  +  pD0)(m  +  l)A°(m/(m  +  1))A° 

<  (LCo  +  p£>0)(7(m+l)Ao 

<  D0(m+l)Ao, 

completing  the  induction  step  and  thus  the  proof  of  the  lemma. 

Proof  of  Lemma  4.2.  By  boundedness  of  ip,  the  sequence  {rt}  generated  by 
(4)  is  bounded  and  thus  trivially  square  integrable.  Hence,  in  view  of  the  above 
lemma  A,  it  suffices  to  show  that  G  is  Lipschitz  continuous  in  x  and  a  contraction 
mapping  in  r.  As  by  assumption  the  first  derivative  of  xl>  is  uniformly  bounded,  G 
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is  clearly  Lipschitz  continuous  in  x  with  Lipschitz  constant  L  —  M^\Cr\-  (If  A  is 
a  matrix,  then  |.A|  :=  max{|ylx|  :  |x|  =  1}.)  Similarly,  let  VrG  denote  the  matrix 
of  partial  derivatives  of  G  with  respect  to  r.  Note  that  |VrG(x,  r,  9)\  is  the  square 
root  of  the  maximal  singular  value  of  VrG,  and  thus  by  a  well-known  result  from 
linear  algebra, 

\VrG(x,r,l)\      <      (trace  (VrG(x,r,0)VrG(x,r,0)'))1/2 
<      MtA(trace(Dr^)1/2 
=      M^|vec(Dr)| 
=  •'     P- 
By  assumption,  p  <  1.  As  clearly, 

\G(x,rud)  -  G(x,r2,0)\  <  sup  \VrG(x,r,e)\  \rl-r2\<p\ri  -  r2\, 

r 

G  is  a  contraction  mapping  in  r,  thereby  completing  the  proof  of  lemma  4.2. 

Proof  of  theorem  4.3.  We  verify  the  conditions  of  corollary  3.5  of  Kuan  & 
White  (1990),  which  we  shall  briefly  refer  to  as  [KW].  Their  conditions  A. 4  and 
C.3  are  explicitly  assumed  (our  assumptions  A. 3  and  A. 4).  It  follows  from  lemma 
4.2  that  {rt}  and  thus  also  {£t}  are  bounded  sequences  NED  on  {V*}  of  size 
—  1,  where  £t  =  [x't,r't]',  which  establishes  condition  C.l  of  [KW].  Let  M$  be  an 
upper  bound  for  the  sequence  {£t},  and  let  K%  :—  {£  :  |£|  <  A/^}.  Condition  C.2 
of  [KW]  requires  that  in  K%  x  0,  both  F(£,  •)  and  V$F(f ,  •)  satisfy  a  Lipschitz 
condition  with  Lipschitz  constants  L\{£)  and  Lo{£,),  respectively,  where  L\  and 
Ln  are  Lipschitz  continuous  in  £,  and  that  both  F(-,0)  and  V$F(-,0)  satisfy  a 
Lipschitz  condition.  It  is  straightforward  to  show  that  continuous  differentiability 
of  A.2(i)  ensures  these  Lipschitz  conditions.  See  also  corollary  4.1  of  Kuan  k. 
White  (1990). 
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Proof  of  theorem  4.4.  We  verify  the  conditions  of  corollary  3.6  of  [KW].  Lemma 
4.2  ensures  that  {rt}  is  NED  on  {Vt}  of  size  —8.  Stationarity  of  {xt}  implies  that 
{rt}  is  also  stationary.  Hence,  {£t}  is  a  stationary  sequence  NED  on  {Vt}  of  size 
—8,  which  estabishes  condition  D.l  of  [KW].  Condition  D.2  of  [KW]  follows  from 
B.3  and  the  moment  condition  of  B.l.  Finally,  as  in  the  preceding  proof,  four 
times  continuous  differentiability  of  B.2  ensures  the  Lipschitz  conditions  imposed 
in  condition  D.3  of  [KW].  See  also  corollary  4.2  of  Kuan  k  White  (1990). 
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